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Least squares fitting is in general not useful for high-dimensional 
' linear models, in which the number of predictors is of the same or 

even larger order of magnitude than the number of samples. Theory 
^-j' developed in recent years has coined a paradigm according to which 

, sparsity-promoting regularization is regarded as a necessity in such 

setting. Deviating from this paradigm, we show that non-negativity 
constraints on the regression coefficients may be similarly effective 
as explicit regularization. For a broad range of designs with Gram 
matrix having non-negative entries, we establish bounds on the £2- 
prediction error of non-negative least squares (NNLS) whose form 
qualitatively matches corresponding results for £i-regularization. Un- 
der slightly stronger conditions, it is established that NNLS followed 
, by hard thresholding performs excellently in terms of support re- 

. covery of an (approximately) sparse target, in some cases improv- 

ing over £i-regularization. A substantial advantage of NNLS over 
regularization-based approaches is the absence of tuning parameters, 
which is convenient from a computational as well as from a practi- 
tioner's point of view. Deconvolution of positive spike trains is pre- 
^ . sented as application. 

m 
in 

ON . 

, 1. Introduction. Consider the linear regression model 

O: (1.1) y = X(3*+e, 

where y is a vector of observations, X 6 M. nxp a design matrix, e a vector 
of noise and (3* a vector of coefficients to be estimated. Throughout this pa- 
^ ' per, we are concerned with a high-dimensional setting in which the number of 

5— i ■ unknowns p is at least of the same order of magnitude as the number of obser- 

vations n, i.e. p = 0(n) or even p n, in which case one cannot hope to recover 
the target /3* if it does not satisfy one of various kinds of sparsity constraints, 
the simplest being that (3* is supported on S = {j : /3* 7^ 0}, \S\ = s < n. In 
this paper, we additionally assume that /3* is non- negative, i.e. /J* S R+. This 
constraint is particularly relevant when modelling non-negative data, which 
emerge e.g. in the form of pixel intensity values of an image, time measure- 
ments, histograms or count data, economical quantities such as prices, incomes 
and growth rates. Non-negativity constraints occur naturally in numerous de- 
convolution and unmixing problems in diverse fields such as acoustics [29], 
astronomical imaging [2], hyperspectral imaging [1], genomics [28], proteomics 
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[39], spectroscopy [20] and network tomography [31]; see [12] for a survey. As 
reported in these references, non-negative least squares (NNLS) yields at least 
reasonably good, sometimes even excellent results in practice, which may seem 
surprising in view of the simplicity of that approach. The NNLS problem is 
given by the quadratic program 

(1.2) mbI||,,-X/J|5, 

which is a convex optimization problem that can be solved efficiently [27]. 
Solid theoretical support for the empirical success of NNLS from a statistical 
perspective scarcely appears in the literature. An early reference is [20] dat- 
ing back already two decades. The authors show that, depending on X and 
the sparsity of j3*, NNLS may have a 'super-resolution'-property that permits 
reliable estimation of f3* . Rather recently, sparse recovery of non- negative sig- 
nals in a noiseless setting {e = 0) has been studied in [5, 19, 48, 49]. One 
important finding of this body of work is that non-negativity constraints alone 
may suffice for sparse recovery, without the need to employ sparsity-promoting 
^i-regularization as usually. The main contribution of the present paper is a 
transfer of this intriguing result to a more realistic noisy setup, contradicting 
the well-established paradigm that regularized estimation is necessary to cope 
with high dimensionality and to prevent over-adaptation to noise, thereby im- 
proving the understanding of the empirical success of NNLS. More specifically, 
we demonstrate that under suitable conditions on X, which are fulfilled, for 
example, if X has exclusively non-negative entries (a condition which is of- 
ten met naturally for the applications listed above), hard thresholding of a 
minimizer j3 of (1.2) is an effective device for support recovery that may even 
improve over ^-regularized least squares (lasso, [41]). Due to its combination 
of favourable computational and statistical properties, the lasso enjoys most 
popularity among various sparsity-promoting regularization schemes that have 
been proposed in recent years. As surveyed in [45], a series of strong theoret- 
ical guarantees concerning prediction, estimation of (3* and feature selection 
with the lasso have been established under various conditions on the design, 
some of which apply rather broadly. On the other hand, NNLS, as a pure fit- 
ting approach, avoids shortcomings of £i-regularization. The bias caused by 
the regularization term negatively affects estimation of (3* with respect to the 
^oo-norm and the performance for feature selection [52, 53, 56]. While different 
remedies have been proposed, such as subsequent thresholding for improved 
selection [33, 54], or the adaptive lasso [56], which addresses the bias, all these 
schemes, like any regularization-based scheme, entail careful tuning of at least 
one parameter. On the contrary, this is not required for NNLS and constitutes 
a definite advantage, particularly from a practical point of view. Parameter 
tuning by cross-validation increases the computational burden and may be 
error-prone if done by a grid search, since the grid could have an unfavourable 
range or a too small number of grid points. Apart from that, the use of cross- 
validation may not always be adequate in a fixed design setup, e.g. for the 
application in [39]. 

Outline and contributions of the paper. The paper significantly extends a pre- 
vious conference publication [40]. Meinshausen has independently studied the 
performance of NNLS in a high-dimensional setting. His paper [31] contains 
an oracle property with respect to prediction and an £i-bound for estimating 
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the coefficient vector. That work is related to ours in Section 2.5. Prediction 
performance is studied in Section 2, while the focus of our paper is on sparse 
recovery (Section 3). In particular, we study sparse recovery in an ^-sense 
and support recovery by thresholding. It is shown that the threshold can be 
chosen in an entirely data-driven way requiring only a single fit to the data as 
opposed to cross-validation. 

We develop a common geometric framework hinging on large-margin separat- 
ing hyperplanes within which the conditions required for our main results can 
be interpreted. 

In Section 4, theory is complemented by illustrations and numerical results. 
It is outlined that NNLS can be useful for spike train deconvolution. Second, 
the performance with regard to sparse recovery compared to standard methods 
such as the lasso is investigated. 

Proofs of statements not proven in the main manuscript are contained in the 
appendix. 

Notation. We denote the usual ^ g -norm by |H| q . The cardinality of a set is 
denoted by | • |. Let J, K C {1, . . . , m} be index sets. For a matrix A £ R nxm , 
Aj denotes the matrix one obtains by extracting the columns corresponding 
to J. For j = 1, . . . ,m, Aj denotes the j-th column of A, while A 3 denotes 
the j-th row. The matrix Ajx is the sub-matrix of A by extracting rows in 
J and columns in K. Likewise, for v € M m , vj is the sub- vector correspond- 
ing to J. The identity matrix is denoted by / and vectors of ones by 1. The 
symbols ^, y and -<, y denote componentwise inequalities and componentwise 
strict inequalities, respectively. In addition, for some matrix A, A y a means 
that all entries of A are at least equal to a scalar a. The non- negative orthant 
{x G W 71 : x y 0} is denoted by M™. The standard simplex in R m , that is the 
set {x S : YlJLi x j = 1} i s denoted by T m . Lower and uppercase c's 
like c, d, ci, c, c and C,C ,C±,C,C etc. denote positive constants not depend- 
ing on the sample size n whose values may differ from line to line. In general, 
the positive integers p = p n and s = s n depend on n. Landau's symbols are 
denoted by o(-), O(-), ^(-)- Asymptotics is to be understood w.r.t. a triangular 

array of observations {(X n , y n ), X n £ R nXpn }, n = 1,2, 

Normalization. If not stated otherwise, the design matrix X is considered as de- 

1 1 2 

terministic, having its columns normalized such that ||.Xj|| 2 = n, j = 1, . . . ,p. 
General linear position. We say that the columns of X are in general linear 
position in M n if the following condition (GLP) holds 

(1.3) (GLP): VJC {l,...,p}, \J\ <n: Xj\ = => A = 0. 

2. Prediction error. In the following section, NNLS is studied from the 
point of view of prediction. We first present several negative examples, where 
NNLS breaks down or overfits similarly as its unconstrained counterpart. We 
then derive a sufficient condition on X which allows us to show consistency 
of NNLS in a high-dimensional setting. Linking that condition to the given 
examples, we identify a transition regarding resistance to overfitting within 
the set of design matrices. 

2.1. Minimum requirement on the design. In general, the non-negativity con- 
straints in (1.2) may not be meaningful at all, given the fact that any least 
squares problem can be reformulated as a non-negative least squares problem 
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with an augmented design matrix [X — X]. More generally, NNLS can be as 
ill-posed as least squares if the following condition (T~L) does not hold. 

(2.1) {%) : 3w € ]R n such that X T w y 0. 

Condition (%) requires the columns of X be contained in the interior of a half- 
space containing the origin. If (H) fails to hold, G conv{Xj}^ =1 , so that there 

are infinitely many minimizers /3 of the NNLS problem (1.2). If additionally 
p > n and the columns of X are in general linear position (condition (GLP) in 
(1.3)), must be in the interior of conv{Xj}^ =1 . It then follows that C = M. n , 
where C = cone{Xj}? =1 denotes the polyhedral cone generated by the columns 
of X. Consequently, the non-negativity constraints become vacuous and NNLS 
yields perfect fit for any observation vector y. In light of this, (H) constitutes 
a necessary condition for a possible improvement of NNLS over ordinary least 
squares. 

The half-space constraint (H) is fulfilled, for example, if a subset of the rows 
of X takes only positive or only negative values, or if all columns of X are 
positively correlated, i.e. if all inner products of pairs of columns are non- 
negative. 

On the contrary, random matrices having entries drawn i.i.d. from a zero-mean 
sub-Gaussian distribution, as they are typically considered in the compressed 
sensing literature [10, 11, 13] fail to satisfy (H) with high probability iip/n > 2, 
which follows from Wendel's Theorem as pointed out in [19]. 

Theorem 1. (Wendel, [37, 50]) 

Let Xi, . . . ,X p be i.i.d. random points in M. n whose distribution is symmetric 
with respect to and which assigns measure zero to every hyperplane through 
0. Then for X = [X\ . . . X p ], it holds that 

(2.2) P (X obeys (H) ) = P(0 < B < n - 1), 

where B follows a binomial distribution with p — 1 trials and probability of 
success equal to \ . 



For X as in above theorem, upper bounding (2.2) by means of Hoeffding's 
inequality implies that for p/n > 2, 

(2.3) P (x obeys (H) ) < exp (-n (p/n - 2) 2 /2) . 

2.2. Overfitting under condition {%). Since NNLS is a pure fitting approach, 
over-adaptation to noise is a natural concern. Resistance to overfitting can 
be quantified in terms of ^||^/3||2 when y is a vector of Gaussian noise. The 
following two examples show that in general, condition {%) is not sufficient 
to ensure that ^H^/^Hl = with high probability. In Theorem 2 below, 
we consider random design whose realizations can be interpreted as generic 
representatives of the class of design matrices fulfilling condition (%)■ 

Theorem 2. Let y = e, where e has i.i.d. standard Gaussian entries, and 
( v T \ 

let X = I I be independent of y, where v is an n-dimensional random 
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vector with i.i.d. entries from a zero-truncated standard Gaussian distribution 
and U is a random matrix of dimension (n — 1) xp independent of v having 
i.i.d. standard Gaussian entries. Then, there exists a constant Cq so that if 
p/n > Co > 2, it holds that ~||X/3||| = S7(l) with probability at least 1 — 
C exp(— cn), C, c > 0. 



Overfitting with orthonormal design. Suppose that X is orthonormal, i.e. 
X T X = nip. It is easy to see that NNLS also overfits in this case. Let y = e 
be a standard Gaussian random vector as above. The optimal coefficients can 
be given in closed form. We have that 

Pj = m&x{Xje,0}/ \\Xj\\l , j = 1, . . . ,p, 

so that the distribution of each component of /3 is given by a mixture of 
a point mass 0.5 at zero and a half-normal distribution. We conclude that 
= ~\\j3\\l is of the order 0(p/n) with high probability. The fact that 
X is orthonormal is much stronger than the obviously necessary half-space 
constraint (7i). In fact, as rendered more precisely below, orthonormal design 
turns out to be at the edge of the set of designs still leading to overfitting. 



2.3. Designs with a 'self-regularizing' property. We now derive a sufficient 
condition X has to satisfy so that overfitting is prevented. That condition is 
referred to as self -regularizing property, expressing the fact that the design 
itself automatically generates a regularizing term. Denote by £ = ^X T X the 
(scaled) Gram matrix of the columns of X. The NNLS problem for y = e can 
then be written as 

(2.4) min --e T X/3 + /3 T S/3. 

/3^o n 

Over-fitting can only be prevented by the quadratic term. Assuming that X 
satisfies {%), there exists a unit vector w and h y such that X T w = h. 
Denote by 11^, = ww T the orthogonal projection on w. We then have for all 
P h 

pT^p = P T x T xp = p T x T u w xp + p T x T (i-u w )xp 

n n n 

> P t X^ww' t XP > (/i T /3) 2 > h min T 2 
n ~ n ~ n ' 

where /i m i n is the minimum entry of h, i.e. the quadratic term in /3 can be 
lower bounded by its squared ^i-norm multiplied by some factor. The following 
condition results by optimizing that factor. 

Condition 1. A design X with Gram matrix £ = \X~^ X is said to have a 
self-regularizing property if there exists a universal constant tq > such that 

(2.5) /3 T S/3 > r 2 (l T /3) 2 V/3 h 0. 

For given X, Tq can be obtained numerically as the optimal value of the 
quadratic program 

(2.6) min -||XA|||, T" -1 = {A G R p , : 1 T A = 1}, 
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which equals the squared Euclidean distance of the convex hull of the columns 
of X, scaled by to the origin. Alternatively, to can be interpreted as margin 
of a maximum-margin separating hyperplane with normal vector w, which is 
determined from the optimization problem dual to (2.6): 

(2.7) To = max r subject to —=X T w y t1, and ||n>|| 2 < 1. 



The geometry as well as the duality of (2.6) and (2.7) parallels the construction 
principle of separating hyperplanes in the context of support vector machines, 
cf. e.g. [38], Sec. 7.2. 

Condition (2.5) does not only rule out overfitting. It additionally gives rise 
to the following general bound on the ^-prediction error of NNLS, where the 
assumption that the linear model is specified correctly, is not made. Instead, we 
only assume that there is a fixed target / = (/i, . . . , f n ) T to be approximated 
by a non- negative combination of the columns of X. 

Theorem 3. Let y = f + e, where f S M n is fixed and e has i.i.d. zero-mean 
sub- Gaussian entries with parameter a. Define 

S* = mm-\\XP-f\\l S = -\\XP-f\\l 
p>o n n 

Suppose that X satisfies Condition 1. Then, with probability no less than 
1 — 2/p, it holds that 



(2.8) e<r+ 1211/3111 : 16V H + 

V r o y V n t 2 n 

for all p* eargmm±||X/3-/|||. 

2.4. Discussion. 

Discussion of Condition 1. One might ask whether instead of Condition 1, 
which provides a lower bound on the quadratic form /3 T S/3 for all non- negative 
vectors in terms of their ^i-norm, one could employ the following alternative: 
There exists a universal constant n > so that 

/3 T S/3> ?? ||/3||2 V/3>:0. 

However, this condition is in general too weak, since it is satisfied for orthonor- 
mal design with rj = 1, for which NNLS overfits. On the other hand, orthonor- 
mal design does not obey Condition 1, because it leads to Tq = 1/n in (2.5). 
Likewise, the random design of Theorem 2 does not obey Condition 1 with 
high probability. In fact, with the same notation as in the theorem, one has 

n n 

By Wendel's Theorem 1, for p > 2.5n, zero is contained in the convex hull of 
the columns of U with probability close to 1, so that Tq cannot be larger than 
IMIoo/ n ' which scales as 0(log(p)/n) with high probability. These observations 
suggest that the scaling of Tq in (2.5) is a reasonable indicator for potential 
resistance to overfitting. 



6 



Examples of designs satisfying Condition 1. 

Entry-wise positive lower bound on the Gram matrix. If X y o~q > 0, then 
(2.5) holds with Tq = o~o- In particular, this is fulfilled if the columns of X 
are all strictly contained in the interior of an orthant of W n . This example 
suggests that orthonormal design marks the transition between overfitting and 
resistance to overfitting, respectively. 

Band- or block structure with a positive lower bound. If E y and if the set 
of predictors indexed by {1, . . . ,p} can be partitioned into blocks B\, . . . , Bk 
such that mini<KK Xb, h o"o> then 

K -. K 
mm/3 T S/3 > £ l3\ -X^XbM > a ^(1 T Psf > ^(l T /3) 2 . 
«=i l=i 

In particular, as sketched in Figure 1, this applies to design matrices whose en- 
tries contain the function evaluations at points {itj}™ =1 C [a, b] of non-negative 
functions such as splines, Gaussian kernels and related 'localized' functions 
traditionally used for data smoothing. If the points {ui}f =1 are placed evenly 
in [a, b] then the corresponding Gram matrix effectively has a band structure. 
For instance, suppose that u, = i/n, i = l,...,n, and consider indicator 
functions of sub-intervals <f>j{u) = I{u G [(fXj — h) V a, (/ij + h) A 6]}, where 
Hj G [0,1], j = and h = 1/K for some positive integer K. Setting 

X = (4>j(ui))i<i< n) i<j< P and partitioning the {fij} by dividing [0, 1] into inter- 
vals [0,h], (h,2h], (1 — h, 1] and accordingly B\ = {j : fj,j G ((/ — l-h]}, 
I = 1,...,K, we have that mini</<^ -X^ Xb 1 h h such that property (2.5) 
holds with r 2 = h/K = 1/K 2 . 




Fig 1. Block partitioning of 15 Gaussians into K = 5 blocks. The right part shows the 
corresponding pattern of the Gram matrix. 



Discussion of Theorem 3. In Theorem 3, it is established that the ^-prediction 
error of NNLS approaches that of the optimal non-negative combination of the 
columns of X provided 11/3*11} = o(y / n/ dog p), which implies that NNLS can 
be consistent in a regime in which the number of predictors is nearly expo- 
nential in n. In that sense, NNLS constitutes a 'persistent procedure' in the 
spirit of Greenshtein and Ritov [25] who coined the notion of 'persistence' as 
distinguished from classical consistency with a fixed number of predictors. The 
authors study persistence of the lasso estimator , defined as a minimizer of 

(2.9) min — \\y - Xf3\\\ + A , 
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where A > is the regularization parameter. Interestingly, our Theorem 3 
can be matched to a corresponding result for the lasso. If the linear model is 
correct, i.e. if / = X(3* with (3* y and consequently £* = 0, the leading 
term of the bound (2.8) is of the order 0(11/3*1^ y / log(p)/n). The same rate — 
without any assumption on the design — is readily obtained for the lasso, using 
an argument that can essentially be found in [6]. Since P^ 1 is a minimizer of 

(2.9) , one has 

-\\x(^ - nwi + iii < -e T x(^ —/?*)+ a uriix • 

n n 
Choosing A = 2Ao, Ao = ^||X T e||oo, applying Holder's inequality to the right 
hand side, using \\/3 1 — < \\i + \\(3*\\i and re-arranging, it follows that 
i||X(^ - < 3A \\P*\\ V With e as in Theorem 3, A = 0{^\og{p)/n) 

with high probability, such that one ends up with a bound qualitatively equal 
to (2.8). 

In [26] , the problem of finding an optimal (in the sense of -^-prediction error) 
convex combination out of a set of given regression functions is considered, 
which, in our setting would correspond to the additional constraint l T /3 = 1. 
When setting = 1 in (2.8), the rate becomes qualitatively equal to the 

corresponding result in [26]. 

Remark. NNLS has been introduced as a tool for 'non- negative data'. In this 
context, the assumption of zero-mean noise in Theorem 3 is questionable. In 
case that the entries of e have a positive mean, one can decompose e into a 
constant term, which can be absorbed into the linear model, and a second term 
which has mean zero. 

2.5. Near-optimal rate under the 'compatibility condition'. In this subsection, 
we relate our work to the recent paper by Meinshausen [31] who has indepen- 
dently studied the performance of NNLS for high-dimensional linear models. 
In [31], our Condition 1, which is termed 'positive eigenvalue condition' there, 
is combined with an assumption of an underlying sparse linear model and the 
'compatibility condition' [44, 45], which yields a bound on ||/3 — /3*||i and an 
improved rate for the -^-prediction error in the flavour of sparsity oracle in- 
equalities previously derived for the lasso, see e.g. [3, 7, 9, 44]. These works 
show that the lasso may adapt to the underlying sparsity, since, under appro- 
priate assumptions on the design, the ^2-prediction error may be only of the 
order 0(s \og(p)/n), where s is the number of non-zero entries of j3* . Apart 
from a logarithmic factor, that is what could be achieved if the support of /?* 
were known in advance. In the sequel, we state a result of that type. 

Condition 2. [45] Let S C {1, . . . ,p}, \S\ = s and a > 1. Define 

1l{a,S) = {SeW:\\Sso\\ 1 <a\\5 s \\ 1 }. 

We say that the design satisfies the (a, S)- compatibility condition if there exists 
a constant 4>(a, S) such that 

(2.10) 5 T E5 > s~ l <j){a, S)\\5 S \\\ V5 G K(a, S). 

The compatibility condition is slightly weaker than the corresponding restricted 
eigenvalue condition [3] which applies to random sub-Gaussian matrices for s 
sufficiently small in a broad sense [55]. 
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Theorem 4. Assume that y = X(3* +e, where (3* >z has support S, \S\ = s, 
e has i.i.d. sub- Gaussian entries with sub- Gaussian parameter a. Further as- 
sume that X satisfies Condition 1 as well as the (3 / Tq , S)- compatibility condi- 
tion. It then holds that 

with probability no less than 1 — 2/p. 

A similar result with better constants is shown in [31]. However, in [31] a lower 
bound on the minimum non-zero coefficient of (3* is required, which is not 
natural and does not appear in corresponding results for the lasso. One can 
deduce sparse recovery by thresholding as an immediate corollary of Theorem 
4, which provides an ^i-bound for the estimation of (3* . Considerably stronger 
results concerning sparse recovery are implied by the ^-bounds of the follow- 
ing second part of the paper. 

3. Sparse recovery. Within this section, we focus on the situation where 
j3* is at least approximately sparse. We argue that NNLS may achieve a near- 
optimal rate for estimating (3* in sup-norm, which makes thresholding an ef- 
fective device for support recovery. 




Fig 2. Left panel: Cone C — A'R^ C R+ for p = 6. Right panel: A slice of C. The columns 
of X are indicated by round dots. The face containing y is depicted by a bold black segment 
which is part of a solid line representing the separating hyperplane with corresponding normal 
vector w. 

3.1. Geometry. For the rest of the paper, we assume that the columns of X are 
in general linear position, i.e. that condition (GLP) in (1.3) holds. We start by 
studying the noiseless (e = 0) and exactly sparse case. Figure 2 provides some 
intuition for the problem on the basis of geometric ideas used and developed 
in [15, 16, 19, 48]. The left panel displays a polyhedral cone C = cone{Xj}^ =1 , 
p = 6 contained in In the given example, the vector y (represented by a 
square) is contained in the boundary of C, whereas y' (represented by a triangle) 
is contained in the interior of C. Consider now the two underdetermined linear 
systems of equations with non-negativity constraints 

y = X(3 sb.t. pyo, y' = X0 sb.t. >z 0. 
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It is clear from the picture that f3 is determined uniquely, whereas P' is not. The 
observation to be made here is that y is contained in a two-dimensional face 
of C. In general, given F C {1, . . . , p}, F ^ 0, Cf = cone{Xj}j £ F, which is the 
polyhedral cone generated by the sub-matrix Xp, is called a face of C if there 
exists w £ M. n such that {Xj,w) = 0, j G F, and (Xj,w) > 0, j ^ F. In other 
words, there exists a hyperplane containing the origin separating the cones 
Cf and Cf c - Given this definition, we conclude with the following statement 
regarding support recovery in the noiseless and exactly sparse case. 

Proposition 1. Let y = Xj3* , where /3* y has support S, \S\ = s. The 
constrained linear system X(3 = y sb.t. f3 y 0, has f3* as its unique solution 
if Cs is a face of C. Conversely, if Cs is not a face of C, there exists j3* >z 
supported only on S such that uniqueness fails to hold. 



Proof. By definition, since Cs is a face of C, there is a hyperplane separating 
Cs from Cgc, i.e. there exists w £ W 1 such that (Xj,w) = 0, j G S, (Xj,w) > 
0, j E S c . Assume that there is a second solution f3*+5, 5^0. Expand Xs(/3 S + 
5s) + XscSs c = V- Multiplying both sides by w T yields X^es c (^j> w ) $j = 0- 
Since /3J C = 0, feasibility requires 5j > 0, j G S c . All inner products within the 
sum are positive, concluding that 5s<: = 0. General position of the columns of 
X (1.3) implies that 5s = 0. 

If Cs is not a face of C, then the convex hulls generated by {Xj}j^s and 
{Ajljggc, respectively, have a non-empty intersection. □ 



To achieve robustness in the presence of noise, we strengthen the notion of a 
face by quantifying the amount of separation. The following definition naturally 
extends (2.7). 

Definition 1. Let S C {1, . . . ,p}, S ^ 0, be given. The separating hyper- 
plane constant ( with respect to S) is defined as optimal value of the quadratic 
program 



t(S) = maxr 

T,W 

(3.1) subject to —=XgW = 0, — ^=Xg c w >z rl, ||u>|| 2 < 1- 

\/n \ n 



Clearly, t(S) > if and only if Cs is a face of C. In view of 

1 „T . -T 1 

1 " 



n —=X~Jw > t <J=^> min \ T —=XJ c w > r, 



where T p ~ s ~ 1 = {A G R p + s : A T 1 = 1} is the standard simplex in W~ s , it 
is easy to see that t(S) can be defined equivalently in terms of the following 
optimization problem dual to (3.1). 



(3.2) r(5) = „ _ mm 1 — \\X S 9 - X S o\\\ 2 , 



1 

i.e. t(S) equals the distance of the subspace spanned by the support columns 
to the convex hull of the off-support columns. In the following, we denote by 
Tig and the orthogonal projections on the subspace spanned by {Xj}j<=s 
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and its orthogonal complement, respectively, and set Z = HgXs c - We can then 
compactly rewrite (3.2) as 

(3.3) t 2 (S) = min -||ZA||^ = min X T -Z T ZX. 

AgTP _ s _i n AgTP _ s _i n 

From (3.3), we see that the separating hyperplane constant is nothing else 
than the constant ro in Condition 1, applied with respect to the matrix Z 
and the linear space orthogonal to that spanned by the support columns. This 
observation will permit us to carry over parts of the reasoning underlying the 
results of the preceding section to establish robustness of sparse recovery. 



3.2. A general robust sparse recovery result. Equipped with the separating 
hyperplane constant, we are in position to the state the following result, which 
asserts robustness of sparse recovery with respect to additive noise and ap- 
proximate sparsity. Writing (3f^ > . . . > P* p ^ > for the sequence of ordered 
coefficients, let S = {j : Pj> P? s -\} be the set of the s largest coefficients of (3* 
(for simplicity, assume that there are no ties). For the results that follow, we 
think of \\(3gc ||i being considerably smaller than the entries of f5* s . The standard 
case of exact sparsity in which S equals the support of j3* is covered by setting 
||/3£ c ||i = 0. In order to state the following Theorem 5, the following quantities 
are required. 
(3.4) 

(3min(S) = min/3*, K(S) = max HE^Hco, <Pmm(S) = min ||S 5>S "y|| 2 . 
i e5 IMIoo =1 IMI 2 =1 

Theorem 5. Given the linear model y = X(3* + e, where j3* >z and e has 
i.i.d. zero-mean sub-Gaussian entries with sub-Gaussian parameter a. Set 



If /3min(5') > b, then the NNLS estimator (3 has the following properties with 
probability no less than 1 — A/p: 

WsA\i<b and \\h-Ps\\oo < b. 



Roughly speaking, Theorem 5 guarantees that if the columns correspond- 
ing to S are 'sufficiently separated' from those in S c as quantified by t 2 (S), 
then ||/3s , c|| 1 can be upper bounded in terms of ||/3g c ||i, the noise level and 
t 2 (S), provided that the entries of of f3* s are all large enough. Since, as dis- 
cussed below, t 2 (S) cannot be expected to be larger than f2(s _1 ) and K(S) = 
\fs {(ftminiS)}' 1 (cf. (3.4)) in the worst case, the bound on \\/3s — ^g\\oo is 
sub-optimal because of its dependence on s. This will be improved in Section 
3.6. 



3.3. Analysis by decomposition. Theorem 5 follows from a decomposition of 
the NNLS problem into two sub-problems corresponding to S and S c , respec- 
tively. The latter is tackled by using the separating hyperplane constant. That 
decomposition, which is detailed below, serves as our main proof technique 
that is used for several subsequent results as well. 
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Key lemmas. The first lemma is immediate from the KKT optimality condi- 
tions of the NNLS problem. Its proof is hence omitted. 

Lemma 1. (3 is a minimizer of (1.2) if and only if there exists F C {1, . . . ,p} 
such that 

-Xj(y-XB) = 0, and % > 0, j e F, -Xj(y-X(3) < 0, and % = 0, j € P c . 

Lemma 1 implies that the NNLS estimator (3 can be obtained from the following 
restricted least squares problem, once an active set F is known: 

I 

min — \\y — X/3||| subject to /3fc = 0. 

Note that F can always be chosen so that \F\ < n, and that F is determined 
uniquely whenever y ^ C (and the columns of X are in general linear position 
(1.3)), in which case Xj3, the projection of y on C, is contained in the boundary 
of C, i.e. in a lower-dimensional face Cf of C. 

The next lemma is crucial, since it permits us to decouple f3$ from (3sc 
Lemma 2. Consider the two non-negative least squares problems 
(PI): min i||£ + Zfc - Z/3 (P1) ||1, £ = IL^e, Z = U^X S c 

(P2) : min -||n 5 £ + U s X S cp* c + X s f3* s - X s f3^ - U s Xs^ {P1) \\ 2 2 

with minimizers 

^i) of {PI) andft p V of(P2), respectively. If (3^ y 0, then 
setting B s = /3( P2 ) and (3 S c = /3 (P1) yields a minimizer (3 of the non-negative 
least squares problem (1.2). 



Proof. The NNLS objective is split into two parts in the following way: 
(3.5) 

- \\y - XP\\1 = - \\Usy - X s Ps - U s X S od S of 2 + - 1|£ + Z(3* s , - Zp s °\\l , 
n n n 

Separate minimization of the second summand on the r.h.s. of (3.5) yields 
(3( P1 \ Substituting 8^ P1 ^ for ftgc in the first summand, and minimizing the 
latter amounts to solving (P2). In view of Lemma 1, if B^ P2 ^ y 0, it coincides 
with the unconstrained least squares estimator (3.5) corresponding to prob- 
lem (P2). This implies that the optimal value of (P2) must be zero, because 
the observation vector TLsy of the non-negative least squares problem (P2) is 
contained in the column space of Xg. Since the second summand in (3.5) cor- 
responding to (PI) cannot be made smaller than by separate minimization, we 
have minimized the non-negative least squares objective. The result follows by 
expanding U s y = U s e + n 5 X S c/3* + X S P* S . □ 



Proof of Theorem 5. 
Consider problem (PI) of Lemma 1. 
Step 1: Controlling ||/3 (pi) ||i via t 2 (S). 
feasible for (PI), we have 



since is a minimizer and is a 



^ + zi3* Sc -zp s 4 2 2 <-u + zp* Sc 
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which implies that 
(3.6) 

0^) T -Z T zd iP1) <\\d iP1) h(A + 2 1 -Z T Zft c ), A= max -\Zj(\. 



i<3<(p-«) n 



< ||^ P1 )|| 1 (A + 2| 



As observed in (3.3), t 2 (S) = min^g^p-s-i \ T ±Z T Z\, s.t. the l.h.s. can be 
lower bounded via 

(3.7) 0^) T -Z T Z^ p V > r 2 (5)||^ pl >||?. 

n 

Combining (3.6) and (3.7), we have ||/3 (pi) ||i < (A + 2||/3* c ||i)/r 2 (5). 
Step 2: Back-substitution into (P2). Equipped with the bound just derived, 
we insert /3( P1 ) into problem (P2) of Lemma 2, and show that in conjunction 
with the assumptions made for the minimum support coefficient /3 m i n (S), the 
ordinary least squares estimator corresponding to (-P2) 

^ = argmin -\\U S y - X s ^ - U s Xs^ (P1) ||£ 



n 



has only positive components. Lemma 2 then yields f3^ P2 ^ = (3^ P2 ^ = fts- Using 
the closed form expression for the ordinary least squares estimator, one obtains 



^ P2) = ^slxlllsy 

(3.8) = ^ s x](x s ft + n 5 e - n s x S c0 {pl) - P*s°)) 

= ft + l^Xje - £^£ 5S c(£( pi ) - p* c ). 

It remains to control the two terms As = —Tjg S X s ~E and T,g S T l ss c (^ P1 " > — /3gc). 
For the second term, we have 

\\Z s - s Z S s4P iP1) - ^)lloc < max \\^ s v\U\^0 {P1) ~ Ph)\\oo 

(3.9) M °° 1 

( < 4) K(S)(0 P1 ^\\ 1 + H^HO. 

Step 3: Putting together the pieces. The two random terms A and As are 
maxima of a finite collection of linear combinations of sub-Gaussian random 
variables so that (A. 2) in Appendix A can be applied by estimating Euclidean 
norms. For A, we use that ||Zj|| 2 = ||ngAj|| 2 < ||Aj|| 2 for all j. Second, we 
have 

\ V J £ \ _i 

(3.10) As = max. — — , vj = X s ^ S s e J^ j = 1 >---,s, 

i<j<s n 

where Cj denotes the j-th canonical basis vector. One has 

(3.4) „ 

II ||2 Tvi— 1 v ri— 1 ^ 11 
max \\Vj o = max e\ LwAi AsL^^e,- < 7—. 

i<j<s 3112 i<j<s 1 ss b ss J ~ min (S) 
It follows that the event 



A<J^ nL< 2a l2logP 



n 



V0min(^) V n 
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holds with probability no less than 1 — A/p. Conditional on that event, it follows 
that 

- r 2) ||oo < (b + W s 4i)K(S) + — ^J*^, 

with 6 as in Theorem 5, and hence, using the lower bound on /3 m i Q (S), that 
f3~( P2 ) =p s yO and thus also that (3^ = /3 S c. □ 



3.4. Uniqueness of solution. Uniqueness of the NNLS solution in the noisy 
case may follow in the form of corollaries to Theorems 3 and 5 when combined 
with the observation made after Lemma 1, namely that uniqueness is implied 
by y £ C. 

Corollary 1. Consider the setting of Theorem 3 with £* = 0. 

If ^\\X/3— X/3*\\ 2 — > in probability as n — > oo, then (3 is unique with probability 

tending to one. 

PROOF. Suppose first that y ^ C, then X/3, the projection of y on C, is con- 
tained in its boundary, i.e. in a lower-dimensional face. Using general position 
of the columns of X (1.3) , Proposition 1 implies that f3 is unique. If y were 
already contained in C, one would have y = X/3 and hence 
(3.11) 

-\\XP* - Xf3\\l = -\\XP* - y\\l = -\\e\\l = 0(1), with high probability, 
n n n 

using concentration of the norm of the sub-Gaussian random vector e. However, 
(3.11) contradicts the assumption that ^||X/3* — J^/3||| — > in probability. □ 

Note that f3 may be unique even though in the absence of noise the underdeter- 
mined linear system X/3 = X/3* subject to /3 ^ may have multiple solutions. 
By concentration of measure, the sub-Gaussian noise vector e is contained in 
a set S £ = {v G IR n : C\fn < \\v\\ 2 < C^/n} with high probability. If the inter- 
section (X/3* + S £ ) DC has small volume, y may in fact be no longer contained 
in C. 

Alternatively, a condition for uniqueness can be inferred from the decomposi- 
tion of the previous subsection. 

Corollary 2. In the setting of Theorem 5, if it holds that 



?<yc l /logp . logp 

H — n , „. — > U asn— > oo, 



t 2 (S) V n r 2 (S)n 
and that s < \n, then (3 is unique with probability tending to one. 



Proof. Observe that y G C implies that ^\\y — Xf3\\\ = and hence also 
that + Z/3* sc - Zfis4l = ( cf - ( 3 - 5 ))> which in turn implies that = 
^\\Z(/3g c — AsOlli- Moreover, ^\\y — X/3||| = implies that f3s c is a minimizer 
of the sub-problem (PI) of Lemma 2, and since f3* sc is a feasible solution of 
(PI), it holds that that 

^\\Z(/3* Sc - %c)\\l < \eZ(%, - f3* sc -) < ^||Z T e||oo(||/3^||i + \\Ps4i)- 
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Inserting the bound on ||/?,sc||i of Theorem 5 and using that i||Z T ^|| 00 = 
0(^log(p)/n) with high probability, the assumption of the corollary implies 
that ^\\Z{P* Sc - /3sc)\\l tends to zero, which contradicts = ^\\Z{f3* Sc - 

Ps c )\\% since ^ll^lli = ^ll n <Hli = ^(1) witn ^ig n probability, using that s < 
\n. □ 

By removing dependence on ||/3g||i, Corollary 2 may constitute a considerable 
improvement over Corollary 1. The latter follows from Theorem 3, where no 
further assumption about /3* or geometric aspects of C like the separating 
hyperplane constant t(S) are made. If t 2 (S) = f2(s -1 ) (the scaling of t 2 (S) will 
be discussed in the next subsection) and ||/?cjc||i = 0, the condition of Corollary 
2 becomes s\og{p)/n — > 0, which is typically regarded to be necessary to cope 
with high-dimensional linear models in the presence of noise, e.g. [47]. 

3.5. Scaling of t 2 {S). In this subsection, we investigate the scaling of t 2 (S), 
the quantity our analysis crucially depends on. It is shown that for what we 
term 'equi-correlation-like' designs, which comprises a broad class of random 
designs that will here serve as proxy for general non-negative designs, t 2 (S) 
scales favourably as il(s _1 ), minus a deviation term, which has moderate de- 
pendence on p. 

Equi-correlation-like designs. Suppose that X is such that £ = (1 — p)I + 
/9ll T , < p < 1. It is then easy to compute (cf. Appendix E) that for any 
Sc{l,...,p} 

(3.12) r 2 (S) = r 2 (s) = (1 ~ p)p + ±Z£ = O^ 1 ). 

(s — l)p + 1 p - s 

The fact that £ has exactly the form given above, requires that n = p, i.e. it is 
not a suitable model for the high-dimensional case where n < p. Therefore, we 
instead look more closely at the following class of random designs (previously, 
X has always been considered fixed) whose population Gram matrix S* = 
E [iX T X] has equi-correlation structure after proper re-scaling of the columns 
of X. More specifically, we consider the following ensemble of random matrices 
(3.13) 

Ens+ : X = ( 

Xij)i<i<m i-i-d. from a sub-Gaussian distribution on M_|_. 

i<i<p 

Among others, the class of sub-Gaussian distributions on R + encompasses all 
distributions on a bounded set on R + , e.g. the family of beta distributions 
(with the uniform distribution as special case) on [0,1], Bernoulli distributions 
on {0, 1} or more generally distributions on positive integers {0, 1, . . . , K}. For 
all random matrices belonging to the class (3.13), the corresponding popula- 
tion Gram matrix £* can be put into equi-correlation structure by re-scaling. 
Denoting the mean of the entries and their squares by fi and fi2, respectively, 
we have £* = (fi2 — ^ 2 )I + ^ 2 11 T such that re-scaling by l/y/Jpz leads to 
equi-correlation with p = p 2 /p2- 

t 2 (S) for equi-correlation-like designs. To obtain a lower bound on t 2 (S) 
for random designs from Ens+, one additionally has to take into account the 
deviation between S and £*. Using tools from non-asymptotic random matrix 
theory, we show that the deviation is moderate, of the order 0(y/log(p)/n). 



15 



Theorem 6. Consider the random matrix ensemble Ens+. Let the rows of 
the random matrix X be scaled in a fashion thatE[±X T X] = (l-p)I + pll T 
for some p G (0, 1). Fix an S C {1, . . . ,p}, \S\ < s. Then there exists constants 
c,ci,C2,cs,C,C' > such that for all n > C \og(p)s 2 , 

(3.14) r\S)>c l --C\l^ 

s V n 

with probability no less than 1— 3/p— exp(— c\n) — 2 exp(— C2 logp) — exp(— C3 log 1 ' 2 ^)^). 

Theorem 6 asserts that the deviation of t 2 (S) from its population counterpart 
is well-controlled provided n,p,s are related by n > C log(p)s 2 . While this is 
encouraging, because it indicates that t 2 (S) is stable under random pertur- 
bations of the Gram matrix, the result is still too pessimistic, as can be seen 
from numerical experiments whose results are reported below. In the numeri- 
cal study conducted, n = 500 is fixed and p € (1.2, 1.5, 2, 3, 5, 10) • n and s € 
(0.01, 0.025, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5) • n vary. For each combination 
of (p, s), as representative of Ens+, a random matrix of dimension n x p whose 
entries follow a uniform distribution on [0, 1], is drawn and re-scaled such that 
the population Gram matrix has equi-correlation structure with parameter p = 
3/4. We set S = {1, . . . , s} and compute Z = (I — Hs)Xsc using a QR decom- 
position of X$ and then solve the quadratic program min Aerp - s -i A T ^Z T ZX 
with value t 2 (S) by means of an interior point method [4]. For each combi- 
nation of (p, s), 100 replications are performed. Figure 3 displays the averages 
obtained for t 2 (S) over these 100 replications. It is revealed that n does not 
have to be significantly larger than s as suggested by Theorem 6. In fact, even 
for s/n as large as 0.3, t 2 (S) is sufficiently bounded away from zero as long as 
p is not dramatically larger than n (p/n = 10). For s/n > 0.3 while simultane- 
ously p/n > 5, t 2 (S) is numerically zero. Indeed, Figure 3 shows that there is a 
distinct region in the (n/p, s/n)-plane where one cannot hope for sparse recov- 
ery. This in accordance with the phase transition characterized by Donoho and 
Tanner [15-17, 19] regarding the proportion of faces of polytopes that continue 
to be faces after a random projection into a lower-dimensional space. While 
we have only considered a specific instance of the class Ens+, it is unlikely — 
not least because of the findings in [18] — that the results should depend on 
the specific probability distribution of the entries, provided the parameter p 
remains unchanged. 

3.6. Convergence in sup-norm and support recovery by thresholding. In this 
section, we argue that under appropriate conditions, the NNLS estimator may 
achieve a near-optimal rate of convergence in sup-norm, i.e. it may hold that 
1 1 P — P* 1 1 00 — 0{sJ\og{p) /n) with high probability, in which case the set S of 
large coefficients can be effectively recovered by hard thresholding. 

Sup-norm error of least squares and the smallest singular values of square ma- 
trices. Suppose that n = p and that the columns of X are linearly indepen- 
dent. Given observations y = X/3* + e as in Theorem 5, a bound on — /3*||oo 1 
where (3 denotes the ordinary least squares estimator, can be deduced analo- 
gously to the third step in the proof of Theorem 5. We have, by definition of 

ft 

(3.15) P-/3*||oo = ||E- 1 X T e/n|| 00 



16 



2 



4 



6 



8 



10 



p/n 



Fig 3. Empirical scalings of the quantity log 2 (r 2 (5')) for Ens+ in dependency of s/n and 
p/n, displayed in form of a contour plot. The lines indicates the level set for —15 (solid, 
2 -15 » 3 ■ 10" 5 ;, -10 (dashed, 2~ 1() w 0.001J and -5 (dotted, 2 -5 w 0.03;. 

When arguing as for (3.10), a lower bound on the square root of the smallest 
eigenvalue of S enters. However, if X is a square matrix, its smallest singular 
value is typically tiny so that a bound based on (3.15) is of little use. To get an 
idea of how that smallest singular value scales, one may look at certain classes 
of random matrices. Considering e.g. X whose rows are drawn i.i.d. from a sub- 
Gaussian distribution, recent results due to Rudelson and Vershynin [34, 35] 
imply that the smallest singular value scales as 0(n -1 / 2 ). 

Sparsity of the NNLS solution. Using the decomposition technique underlying 
Theorem 5, it is argued in the sequel that the NNLS solution may be sufficiently 
sparse, which allows one to replace dependence on the smallest singular value of 
X by dependence on the smallest singular value of a tall rectangular submatrix. 
In case that the number of columns of that sub-matrix is proportionally small 
to the number of its rows, it is common in the literature on sparse recovery 
to assume that the smallest singular value of that matrix is lower bounded 
by a universal constant. More precisely, our argument evolves in the setup of 
Theorem 5 according to the following steps. 

1. Lemma 1 states that the NNLS estimator can be obtained by least squares 
regression of y and a submatrix Xp, whose columns generate Cp, the 
face of C the observation vector y is projected on. We take F = S U G 
as candidate for F, where S is the set of large coefficients of f3* and G, 
\G\ < (n — s), is the support of f5^ P1 ^ as defined in Lemma 2. 

2. Provided T.ppXjy/n has only positive components, /3f = T. FF Xj,y /n, 
while Pf c = 0. 

3. If the condition in 2.) is fulfilled, we have according to (3.15) 



with probability at least 1 — 2/p as for (3.10), where (frmmiF) is the 
smallest eigenvalue of T,ff- 
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With the same reasoning in the proof of Theorem 5, for the condition in 2.) to 
be fulfilled, it has to hold that 



(3.16) /WS) > - 7 ^=J 2 -^. 

This may constitute a significant improvement over the corresponding condi- 

1 /2 

tion on (3 m ™ (S) in Theorem 5 if {(j>mm{F)} , the smallest singular value of 
Xp/y/n, is not too small. As explained above, this assumption is not overly 
restrictive as long as the set F has cardinality significantly smaller than n, 
which is fulfilled only if it holds for both S and G as defined in 1). While S 
is assumed to have small cardinality throughout, bounding \G\ constitutes a 
major obstacle. Yet, the hypothesis that \G\ is small, is perfectly plausible in 
view of the strong bound on ||/3s c ||i i n terms of t 2 {S). We conclude by stating 
the following result. 

Theorem 7. Let the data- generating model be as in Theorem 5 and set F = 
{j : Pj > 0} and b = 2a {0 min (F)}" 1/2 y/2log(p)/n. If Anin(5) >b, then 

S^F, \\Ps-P*s\\oo <b, Ifc -/^dloo <b, 6 = max{6, Halloo}. 
with probability no less than 1 — 2/p. 

A second set of conditions to bound \\/3 — /3*||oo- So far, our reasoning has 
mainly been based on t(S), which scales well for designs with a dense, positive 
correlation structure. On the other hand, t(S) decays rapidly in p for several 
designs of interest. Take orthonormal design, i.e. £ = I p as an example. It is 
not hard to verify that in this case t(S) = r(s) = l/\/p — s, yet it is well- 
known that sparse recovery by thresholding works well in this case [14]. For 
this reason, we introduce a second quantity that permits us to cover this and 
further designs. 

Definition 2. For some fixed S C {l,...,p} and Z = HgXsc, the quantity 

u(S) is defined as 

(3.17) 

u(S) = min min -ZlZ F v , V(F) = {v G rJ 1 : IblU = 1}. 
®^FC{l,...,p-s}veV(F) " oo + 



As compared to t(S), uj(S) is obviously less handy since in general it cannot be 
evaluated numerically once S is fixed, because u)(S) involves minimization over 
all square sub-matrices of ^Z T Z. Furthermore, while the geometric meaning is 
less transparent, it can be shown that u(S) quantifies separation of the cones 
Cs and Cs c as well. In Appendix F, the following properties are proved. 

(3.18) U){S) > O t(S) > O C s is a face of C, 

(3.19) w(5) < 1, with equality if {Xj} je s J- { x j}jas- and ^-X] c X S c b 

(3.20) u(S)> min -(Z T Z) ji + - V mm{(Z T Z) jk , 0}, 

i<i<(p-«) " n ^ 

The lower bound (3.20) is rather conservative. One can think of w(5) being 
lower bounded by a positive constant as long as there are no large principal 
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submatrices of ^Z T Z with significantly negative off-diagonal entries. 
For Theorem 8 below, we need in addition to the quantities in (3.4) the follow- 
ing constant, which is similar to the so-called cumulative local coherence used 
in [8] , which specifically addresses the case where for a small number of pairs 



(j,k), \a jk \ 
set 

(3.21) 



~X~J Xu 



is significant, while it is close to zero otherwise. We 



fJL+(S) = max ) \cr jk \. 



Theorem 8. Let the data- generating model be as in Theorem 5. Set 

2a 



+ 2cr 



b=(b+\\F s 4 1 )iM+(S)K(g) + 



21ogp 



If Pmm(S) > b, then the NNLS estimator (3 has the following properties with 
probability no less than 1 — 4/p 



J S c \\oo 



<b. 



PMoc < b. 



Proof. The proof parallels that of Theorem 5. Only the differences in reason- 
ing are given. 

Step 1: The constant uj(S) gives rise to a bound on the ^oo-norm of /3( pi ). In 
view of Lemma 1, there exists a set F C {1, . . . ,p — s} (we may assume F ^ 0, 
otherwise ^ P1 ) = 0) s.t. /3^ 1} = and s.t. 

-ZlZ F fP = -Zj(C + Zf3* sc ), 
n n 



which implies that 



tn 



< A 



Villi 



A 



mm 



izJZ F v _||^ P1 )|U< A + WPUh, V(F) = {veR^: 

P1) \\oo<A + \\fi* Sc \\i, 



1}, 



mm mm 

V>^FC{l,...,p-s}v£V(F) 



-ZpZpv 

n 



^o;(5)||^ pl )|U<A + ||/3|c||i, 

where we have used Definition 2. We conclude that ||/3 (pl) ||oc < {A+\\P* Sc \\i)/uj{S). 
In Step 2, the bound (3.9) is changed as follows. 



|£ s i£ss^ (P1) - P S o)\\oc (3 ' 4) <' 21) K(S) fi+(S) (WP^Woo + 



The remainder of the proof is along the lines of the proof of Theorem 5, mutatis 
mutandis. □ 



Example: power decay. Let the entries of the Gram matrix £ be given by 
°~jk — p^~ k \ 1 < j, k < p, < p < 1. In general, S can be identified with a co- 
variance matrix of a set of zero- mean, unit variance random variables {Rj}^ =1 . 
Correspondingly, for any 5 C {1, . . . ,p}, the matrix 

(3.22) 



1 Z T Z=-X] c (I-U s )X S c 



n 



n 
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can be interpreted as the conditional covariance matrix of the random vari- 
ables {Rj}jes c conditional on {Rj}j^s- The power decay structure of £ gives 
rise to a Markov random field in which the conditional covariances satisfy 
Cov(Rk, Ri\{Rj}j<zs) > 0, k, I ^ S, with equality if S contains an index m such 
that min{fc,^} < m < max.{k,l}, see e.g. [36]. Hence all entries of ^Z T Z are 
non-negative, such that, using (3.20), uj(S) > min 1 < J <( p _ s ) ^(Z T Z)jj. A lower 
bound on the minimum diagonal entry is obtained by noting that 
(3.23) 

-(Z T Z)jj = Vax(Rj\{R k } k€S ) > Vax(i2j|{i2j_i, Rj+i}) = ajj-Ejj^iE^y 1 ^^, 

with M = {j — 1, j + 1} (we may exclude the cases j = 1 and j = p, since the 
minimum is not attained there) and 

ZjJV =[p p], Satat = 

Explicit computation of the r.h.s. of (3.23) then yields that ^(Z T Z)jj > 

1 — j^t- Moreover, it is well-known [36] that the negative off-diagonal en- 
tries of the inverse of a covariance matrix are — up to a positive multiplicative 
constant — equal to the conditional covariances after conditioning on all re- 
maining variables, i.e. for j / k, (X -1 )^ oc — Cov(Rj, Rk\{Rl}iet{j,k})- I n view 
of the specific Markov random field structure, Cov(Rj,Rk\{Ri}i^.{j i k}) = 
whenever \j — k\ > 1. It thus follows that, independent of S, has at most 
two non-zero off-diagonal entries per row. Hence, K(S) and 4> m m(S) as defined 
in (3.4) are necessarily upper and lower bounded by constants depending on 
p only, but not on s. By the geometric series formula, fJ,+ (S) < j^-. In total, 
by invoking Theorem 8 with ||/3<ye||i = 0, one obtains an ^-error of the right 
order 

0-(3*\\oc<C p ay/\og(p)/n, 
for a constant C p > depending on p only. 

Support recovery by thresholding. It follows immediately from Theorems 7 and 
8 that the set S of large coefficients can be recovered by hard thresholding. Let 
t > be a threshold. Then the hard-thresholded NNLS estimator is defined by 

(3.24) m = P v ^ >t: 

10, otherwise, j = 1, . . . ,p. 

Let us further define S(t) = {j : f3j > 0}. In the setting of Theorems 7 or 8, if 
t > b and if /3 m in{S) > b + 6, one has that S = S(t) with the stated probabili- 
ties. Choosing the threshold t according to the bounds of Theorems 7 and 8 is 
impractical, since it involves dependence on constants that are not accessible. 
In the following, we suggest to use a simple data-driven procedure proposed 
in [22] to be used in conjunction with marginal regression, which avoids direct 
specification of t. Instead, one attempts to estimate s. To simplify the exposi- 
tion, we shall assume for the remainder of the paragraph that H/Sgclli = 0. 
A crucial observation in [22] is that if there are statistics giving rise to a rank- 
ing (rj)? =1 of the predictors {Xj} v - =l so that rj < s for all j £ S, then it is 
sufficient for support recovery to estimate the support size s and select all pre- 
dictors of rank no larger than s. It is a consequence of Theorems 7 and 8 that 
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if Pmm(S) is large enough, the NNLS estimator f3 gives rise to such a ranking 
by setting 

(3.25) r j = k & #7 =£(*), j = l,...,p, %)>%>•••> Pip)- 

In [22], where Gaussian noise is assumed, the following estimate s" for s is 
suggested. In the sequel, denote by ft 2 = E[e 2 ] the variance of e. 

(3.26) s = max jo < k < (p - 1) : > iV 2 lo g™} + 1> 

(3.27) *(&) = ||(n(fc + 1) - n(A;))y|| 2 , k = 0, . . . , (p - 1), 

where n(/c) denotes the orthogonal projection on the subspace spanned by the 
variables with ranks no larger than k and 11(0) = 0. Note that the estimate 
s" in (3.26) depends on the noise variance i? 2 , which is unknown in general. 
However, one may replace it by a simple plug- in estimate resulting from the 
NNLS fit, provided its ^-prediction error -\\Xf3* — Xj3\\ 2 is small enough. More 
specifically, we have the following theorem that hinges on a result in [22]. 

Theorem 9. Let the data- generating model be as in Theorem 5 with \\/3%c\\i = 
0. Let b, b be as in Theorem 7 or Theorem 8. If fl mm (S) > b + b and 

?Vlog n 

tt— — . ^ — ,, > U as n — > oo, 

min ie5 ||(n s -n sv )^;il2 

and if there exists c, C > such that, with probability tending to one as n — ?■ 

oo, 

(3.28) ci? < d < C-d, where & = \\y - Xp\\%/n, 

then, S = S with probability tending to one, where S = {j : r,j < s} ; with 
{rj}P =1 as in (3.25) and s is the estimate of s according to (3.26), with i? 
replaced by §. 



We point out again that the strategy for (implicitly) selecting the threshold 
according to Theorem 9 does not require the specification of any parameter by 
the user. The output S results from a single NNLS fit plus repeated evaluation 
of (3.27), with d 2 replaced by the plug-in estimator (3.28), until the stopping 
criterion in (3.26) is reached, which can be done efficiently by updating QR 
decompositions . 

Finally, we point out that subsequent to thresholding, it is beneficial to re- 
compute the NNLS solution using data (y, X§) only, because the removal of 
superfluous variables leads to a more accurate estimation of the support coef- 
ficients. 



3.7. Comparison of NNLS and the non-negative lasso. In the literature on 
high-dimensional statistical models and sparse recovery, £i-regularization has 
received most attention, see the retrospective [42], and is often propagated 
as the method of choice due to ease of computation and strong theoretical 
guarantees. Similar guarantees for certain designs that are, roughly speaking, 
characterized by a positive correlation structure, have been established for 
NNLS in the preceding sections. In the present subsection, we enumerate sev- 
eral advantages of NNLS over ^i-regularization, some of which are confirmed 
by numerical studies discussed in Section 4. 



21 



The non-negative lasso. Consider the non-negative lasso problem 

(3.29) mini||y-X/3||^ + Al T /3, A > 0, 

/3^o n 

with minimizer denoted by (3 ei ^. In the remainder of the paragraph, we study 
Z?^ 1 from the point of view of variable selection under a non- negative version 
of the 'irrepresentable condition' which has been shown to be a sufficient and 
essentially necessary condition for support recovery of the lasso [32, 47, 53]. 

Definition 3. Let S C {1, . . . ,p} be given. The non-negative irrepresentable 
constant is defined as the smallest value j(S) € [0, 1] such that 

(3.30) Xj c X 5 (XjX s )- 1 l ^ (1 - 7 (S))1. 



Similarly to the constant t(S) used in our analysis of NNLS, the irrepresentable 
constant also quantifies separation between support and off-support, with the 
important difference that the separating functional is affine instead of linear. 
In fact, we may equivalently express (3.30) as 

X]cW + biyj(S)l, X]w + bl = 0, w = -X s (XjX s )~ 1 l, 6 = 1. 

By straightforward modifications of techniques employed in [47], a positive 
lower bound on 7(6*) gives rise to the following properties of /3 x >^ as sketched 
in Appendix H. 



Theorem 10. Consider the data- generating model y = X(3* +e, where e has 
i.i.d. zero-mean sub-Gaussian entries with parameter a, (3* s ^ /3 m m(S) > and 
P* sc = 0. // 
(3-31) 

A > 7-7777, A = 2a \J — and f3 min (S) >b, b = A||S^1|| 00 +- 



it holds that \\f3 s '~ — Ps\\°° ^ b and that f3 s ^- = with probability at least 
1 - 2/p. 



It further follows from the analysis in [47] that a positive lower bound on 'y(S) 
is also a necessary condition for support recovery. Moreover, there is a striking 
resemblance of the result (3.31) and that of Theorem 5, with t 2 (S) playing a 
similar role as 'y(S) and structurally roughly the same condition on (3 m m(S), 
noting that HE^lHoo is related to the constant K(S) defined in (3.4). 
From (3.31) one cannot deduce that the non-negative lasso attains the optimal 
^oo-rate for estimating /?*, because j(S) may be decreasing in s and HS^lHoo 
may be of order 0(y/s). The assumption that j(S) is uniformly bounded away 
from zero is considered to be rather restrictive [33]. On the other hand, under 
a restricted eigenvalue condition [3], which is closely related to Condition 2 
of the present paper and which applies much more broadly than the 'strong 
irrepresentable condition' of [53], — /3*||2 is of the order 0(y/slog(p)/n) 

[3], which is in general not improvable apart from the log factor. However, it 
is no longer guaranteed that (3gc~ = 0, so that subsequent hard thresholding 
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of ffi 1 '^ has been proposed as remedy [33, 54]. The ^2-bound requires a sub- 
optimal condition on /3min (S) for thresholding to identify the support S. Opti- 
mal rates in sup-norm of the order 0(y / log(p)/n) are derived in [9, 30] under a 
'mutual incoherence' condition which, however, in general requires the recov- 
erable sparsity level s to be of the order o(y/n). The fact that £i -regularization 
behaves suboptimally with regard to its -error, which is eventually decisive 
when one is interested in accurate feature selection, is stressed in [52]. In the 
cited references, non-negativity constraints do not appear; however, apart from 
possible changes in constants, we do not see how these constraints could lead 
to a substantially different theory. 

NNLS vs. the non-negative lasso: pros and cons. To sum up, we list advan- 
tages and disadvantages of NNLS and the non-negative lasso, thereby providing 
some guidance on when to use which approach in practice. While NNLS can 
formally be seen as a special case of the non-negative lasso as A — > 0, we always 
think of the latter as applied with a significant amount of regularization (i.e. it 
is implicitly assumed that A > Ao, with Ao as in (3.31)). 

• A drawback of NNLS is that its success for high-dimensional linear models 
is coupled with conditions on the design, the half-space condition (H) 
being a minimum requirement. If these conditions are not met, NNLS is 
hopelessly prone to overfitting. A noted advantage of the non-negative 
lasso is that it performs well for a broader class of designs. 

• There are designs for which NNLS achieves a better performance for 
estimating (3* in sup-norm and hence yields better performance with 
regard to feature selection via thresholding. On the other hand, as far as 
the ^2-error for (3* or the prediction error is concerned, one cannot expect 
a pronounced difference of the two approaches. 

• Prom the point of view of a practitioner, NNLS possesses an important 
advantage over the non-negative lasso: it can be applied directly, without 
the need to specify a regularization parameter. As asserted by Theo- 
rem 9, the threshold can be chosen in an entirely data-driven, compu- 
tationally inexpensive manner. This strongly contrasts with the use of 
regularization-based methods which usually involves cross-validation in 
conjunction with a grid search for the regularization parameter (s). 

4. Illustrations. We conclude by presenting experimental results illustrat- 
ing important aspects discussed in the paper. 

4.1. Deconvolution of spike trains. We consider a positive spike-deconvolution 
model as in [28], which is of importance in various fields of applications. It is 
assumed that the underlying signal /, which is a function on [a, b), is of the 
form 

s 

f( u ) = ^2Pk<f>k{u), ue[a,b], 
fc=l 

with 4>k{') = <K" — A*fe)) k = 1, . . . ,s, where 4> > is known as point-spread 
function (PSF) and the ti^'s define the locations of the spikes contained in 
[a,b]. The amplitudes {/3^}j =1 are assumed to be positive. Assuming further 
that the PSF is known, the problem consists of determining the positions as 
well as the amplitudes of the spikes from n (potentially noisy) samples of the 
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underlying signal /. As demonstrated below, NNLS can be a first step towards 
deconvolution. The idea is to construct a design matrix of the form X = 
(4>j(ui)), where tfij = 4>{- —rrij) for candidate positions {rn,j}^ =1 placed densely 
in [a, b] and {ui}f =1 C [a, b] are the points at which the signal is sampled. Under 
an additive noise model with zero-mean sub-Gaussian noise e, i.e. 

s 

Vi = ^2 Pk<j>k(ui) +£i, i = l,...,n, 
fe=i 

and if X has the self-regularizing property (2.5), cf. Section 2.4, it follows 
immediately from Theorem 3 that the ^-prediction error of NNLS is nicely 
bounded in the sense that 

(4.1) -||/ - XP\\* <£* + cJ^, {ft = f{ Ui )}U, 

n V n 

holds with high probability, where £* = min^o ~||/ — XP\\2- Note that it is 
not realistic to assume that {p>k} P k=i ^ { m j}j=i> l - e - that the linear model is 
correctly specified, albeit we may think of £* being negligible as long as the 
{ m j} P =i are placed densely enough. This means that NNLS may be suitable for 
de- noising. Furthermore, the bound (4.1) implies that j3 must have large com- 
ponents for those columns of X corresponding to locations near the locations 
{fJ>k}k=i of the spikes, which can then be estimated accurately by applying a 
simple form of post-processing as discussed in [39] . 

Figure 1 displays the outcome of an example, in which the design matrix is 
specified correctly so that £* equals zero. The underlying signal is composed 
of five spikes of amplitudes between 0.2 and 0.7 convolved with a Gaussian 
function. The design matrix X = (<j)j(ui)) contains evaluations of p = 200 
Gaussians {4>j} P =1 at n = 100 points {tij}™ =1 , where both the centers of the 
{(j>j} as well as the {uj}" =1 are equi-spaced in the unit interval. The standard 
deviation of the Gaussians is chosen such that it is roughly twice the spacing 
of the {ui}. At this point, it is important to note that the larger the standard 
deviations of the Gaussians, the larger becomes the constant To, cf. (2.5). Ad- 
ditive Gaussian noise with standard deviation 0.09 is used. 
From the left panel of Figure 4, we conclude that over-fitting scarcely occurs, 
the only visible exception being the region between 0.5 and 0.6. The right panel 
displays the coefficient vector j3, which is remarkably sparse and concentrated 
near (in some cases precisely at) the positions of the underlying spikes. 

4.2. Sparse recovery: comparison of the non-negative lasso and NNLS. The 
present subsection illustrates the excellent performance of NNLS with respect 
to sparse recovery by thresholding. 

Setup. We randomly generate data y = X/3* + e, where e has i.i.d. standard 
Gaussian entries. We consider two choices for the design X. For one set of ex- 
periments, the rows of X are drawn i.i.d. from a Gaussian distribution whose 
covariance matrix has the power decay structure of the example following The- 
orem 8 with parameter p = 0.7. For the second set, we pick a representative 
of the class Ens+ (3.13), drawing each entry of X uniformly from [0,1] and 
re-scaling such that the population Gram matrix has equi-correlation struc- 
ture with p = 3/4, precisely as for the numerical results regarding the scaling 
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Fig 4. Left panel: Data as described in the text. The grey area represents X/3* , a linear 
combination of five Gaussians. The dashed line represents the NNLS fit X/3. Right panel: 
True coefficients (top) and NNLS coefficients {/3j}^ =1 (bottom). 



of t 2 (S) in Section 3.5. The target f3* is generated by selecting its support 
S uniformly at random ((3g c = 0) and then setting (3* = b ■ /3 m i n (5)(l + Uj), 
j G S, with /9 m i n (5) = Cp-y/21og(p)/n, where C p is obtained by evaluating the 
constants appearing in the condition on (3 m i n (S) in Theorems 7 and 8 based 
on the underlying population Gram matrices; the \Uj}je.s are drawn i.i.d. uni- 
formly from [0,1], and b is a parameter controlling the signal strength. The 
experiments can be divided into two parts. In the first part, the parameter b is 
kept fixed while the aspect ratio p/n of X and the fraction of sparsity s/n vary. 
In the second part, s/n is fixed to 0.2, while p/n and b vary. When not fixed, 
s/n G {0.05,0.1,0.15,0.2,0.25,0.3}. The grid used for b is chosen specific to 
the designs, calibrated such that the sparse recovery problems are sufficiently 
challenging. For the design from Ens+, p/n G {2,3,5,10}, whereas for power 
decay p/n G {1.5,2,2.5,3,3.5,4}, for reasons that will become clear from the 
results. Each configuration is replicated 100 times for n = 500. 

Comparison. Across these runs, we compare the probability of 'success' of 
thresholded NNLS (tNNLS), non-negative lasso (NN4) as defined in (3.29), 
thresholded non-negative lasso (tNN^i) and orthogonal matching pursuit (OMP, 
[43, 51]). We also compare against the ordinary lasso (2.9); since its perfor- 
mance is mostly nearly equal, partially considerably worse than that of its non- 
negative counterpart (see the right panel of Figure 5 for an example) , the results 
are not shown in the remaining plots for the sake of better readability. 'Success' 
is defined as follows. For tNNLS, we have 'success' if min f3j > maxjggc /3j, 
i.e. there exists a threshold that permits support recovery. For NN^i, we set 
A = 2||X T e/n|| 00 , which is the empirical counterpart to Ao = 2y / 21og(p)/n, the 
choice for the regularization parameter advocated in [45] to achieve the optimal 
rate for the estimation of f3* in the ^2-norm, and compute the whole set of so- 
lutions {/3(A), A > A} using the non-negative lasso modification of LARS [21] 
and check whether the sparsity pattern of one of these solutions recovers S. For 
tNN-^i, we implement a strategy that has essentially been proposed in [33] as 
a remedy for the tendency of the lasso to over-select, assigning small non-zero 



25 



coefficients to elements from S c . On the other hand, the lasso yields an esti- 
mate close to j3* in the ^2-norrn, such that subsequent thresholding may be an 
effective strategy to get rid of false positive selections. In our experiments, we 
inspect the set of non-negative lasso solutions {/3(A) : A € [Ao A A, Ao V A]} and 
check whether mhijgs A' (A) > max^g^c /3, (A) holds for one of these solutions; 
with the range chosen for A, we aim at having solutions that approximate f3* 
well in £2- norm - We believe that this procedure is more reliable than fixing A 
to one specific value known to give the optimal rate in theory. Note that, when 
comparing tNNLS and tNNfi, the lasso is given an advantage, since we opti- 
mize over a range of solutions. Lastly, for OMP, we check whether the support 
S is recovered in the first s steps. 

Results. The approaches NNIi and OMP are not competitive — both work 
only with rather moderate levels of sparsity, with a breakdown at s/n = 0.15 
for power decay as displayed in the left panel of Figure 5. For equi-correlation- 
like design, the results are even worse. This is in accordance with the literature 
where thresholding is proposed as remedy [33, 52, 54]. Yet, for a wide range of 
configurations, tNNLS visibly outperforms tNN^i, a notable exception being 
power decay with larger values for p/n (see Figure 6). This is in contrast to 
the design from Ens+, where even p/n = 10 can be handled. In fact, power 
decay is noticeably different from equi-correlation-like design with regard to 
the scaling of the constant t 2 (S). One can show that for the population Gram 
matrix, it holds that t 2 (S) < 2(1 — p)~ 1 (p — s)" 1 = 0((p — s)^ 1 ) uniformly 
in S. As a result, it is not helpful to apply Theorem 5. On the other hand, 
Theorem 8 involves the constant u>(S) whose scaling for random designs is not 
clear. This issue requires future research, which could considerably improve the 
understanding of NNLS in a high-dimensional setting. 
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Fig 5. Left: Probability of success of the non-negative lasso without thresholding (blue) 
and orthogonal matching pursuit (magenta). Right: Probability of success of the thresholded 
non-negative lasso (blue) and thresholded ordinary lasso (green). 



Data-driven selection of the threshold. In the experiments of the previous 
paragraph, an explicit choice of the threshold has been avoided. Instead, we 
have confined us to examine whether there is a threshold permitting support 
recovery, given full information about the underlying S, which is not available 
in practice. On the other hand, we have outlined a scheme for an implicit data- 
driven choice of the threshold, which is consistent as asserted by Theorem 9. 
We here report results of an experiment that essentially confirms the finding of 
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FlG 6. Comparison of thresholded NNLS (red) and thresholded non-negative lasso (blue). 
Top: Probability of success for the experiments with constant b, while s/n (abscissa) andp/n 
(symbols) vary. Bottom: Probability of success for the experiments with constant s/n, while b 
(abscissa) andp/n (symbols) vary. 



Theorem 9. The experimental setup follows that of the previous paragraph. We 
take random design with i.i.d. uniformly distributed entries. Taking n = 500, 
The range for p/n remains unchanged, while s/n G {0.01,0.025,0.05,0.1,0.2}, 
excluding critically high values such as 0.25 and 0.3; the generation of the non- 
zero entries of (3* remains unchanged as well (the signal strength parameter b is 
kept fixed). The average number of false positive selections over 100 replications 
in dependency of s/n and p/n is displayed in Figure 7. The results are rather 

5 
4.5 
4 

CO 

I 3.5 
o 
_a> 

a) 3 
</) 

<D 

> 2.5 

CO 

a 2 

CD 
CO 

3 1-5 
1 

0.5 

°0 0.05 0.1 0.15 0.2 

s/n 

Fig 7. Average number of false positive selections when using the estimator of s of Theorem 
9. Note the extremely low percentage of false positive selections when compared to s or p 
(n = 500, i.e. p £ {750, 1000, 1500, 2500, 5000} ). 
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promising: false positive selections occur rarely, the absolute number ranging 
from less than one in the majority to a maximum of roughly five in the most 
difficult setting. Note that even in this extreme case, five false positives are 
only one percent of p = 5000 and only five percent of s = 100. Moreover, 
the situation is even better as far as false negatives are concerned. Only for 
s/n = 0.2 and p/n > 5, false negatives occur at all (on average 0.02 and 2.25, 
respectively). 
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APPENDIX A: SUB-GAUSSIAN RANDOM VARIABLES 



A random variable Z is called sub-Gaussian if there exists a positive con- 
stant K such that EflZl 9 ] 1 / 9 < K^Jq. The smallest such K is called the sub- 
Gaussian norm ||^||^, 2 of Z. If E[Z] = 0, which shall be assumed for the re- 
mainder of this paragraph, then the moment-generating function of Z satisfies 
E[exp(tZ)] < exp(— t 2 /{2a 2 )) for a parameter a > which is related to ||^||^ 2 
by a multiplicative constant, cf. [46]. It follows that if Z%, . . . , Z n are i.i.d. copies 
of Z and v € M n , then Y27=i Vi ^ i * s sub-Gaussian with parameter [[fH^u 2 . We 
have the well-known tail bound 

(A.l) p(| Z |>^)<2expf-|^Y z>0. 

Combining the previous two facts and using a union bound, with Z = [Z\ , . . . , Z n ) T , 

it follows that for any collection of vectors Vj € M n , j = 1, . . . ,p, 

(A.2) 

P ( max \vj Z| > a max ||u,-|L \j 2 log p + az | < 2exp ( — z 2 | , z > 0. 
\i<j<P 3 i<j<P J 2 J V 2 / 

APPENDIX B: PROOF OF THEOREM 2 

The proof relies on the following three auxiliary results. 



Concentration of measure of Lipschitz functions of Gaussian random 
vectors. 

Theorem B. 1. (e.g. [46], Proposition 34) 

Let f be a real-valued Lipschitz function on W l with Lipschitz constant L. Let 
g be a standard Gaussian random vector in M. n . Then for every z > 0, it holds 
that 

P(f(g)-E[f(g)] >z)< exp(-z 2 /(2L 2 )), P(E[f(g)]-f(g) > z) < exp(-z 2 /(2L 2 )). 

Concentration of extreme singular values of Gaussian random matri- 
ces. Denote by s m { n (X) and s max (X) the minimum and maximum singular 
value of a matrix X. 

Theorem B. 2. (e.g. [46], Corollary 35) 

Let X be an n x s matrix whose entries are independent standard normal 
variables. Then for every z > 0, each of the following two events hold with 
probability at least 1 — exp(— z 2 /2): 

(X) > y/n- y/s - Z, •'max 

(X) < Vn + + z. 

A property of the convex hull of Gaussian vectors. The following result 
can be a seen as a non-symmetric analog of a result of [24]. 

Theorem B. 3. Let X\, . . . ,X P be random n-dimensional standard Gaussian 
vectors. Denote their convex hull by 1C. Then there exists constants 
C\ > 2, C2, c > such that for all p/n > C\, it holds that 

KZ)C 2 minjv^, y/log(p /n)}B%, B% = {x G R n : \\x\\ 2 < 1}. 

with probability at least 1 — 2exp(— cn). 
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This result can be proved by straightforward modifications of the proof of a 
closely related result in [23]. In [23], the only difference is that {X\, . . . ,X P } 
are drawn according to the uniform distribution on a Euclidean ball. 

Proof of Theorem 2. Let 5 £ (0, 1). The event 



Ex = {m n (l - <5)^/ra < ||e|| 2 < (1 + 5)^}, m n = \Jn/(n + 1). 

holds with probability at least 1 — 2 exp(— 5 2 n/2). This follows from Theorem 
B.l, noting that n/\/n + 1 < E[||e|| 2 ] < \/n for all n > 1. Conditional on Ei, 
for any j3 G W such that ||X/3|| 2 < m n (l — 5)- v /n/2, the triangle inequality 
implies that 

\\e - X(3\\ 2 > m n (l - 6)y/E/2 => i|| £ - X/3||| > m^l - 5) 2 /4. 

In the sequel, it will be shown that one can find a numerical constant Co > 2 
such that if p/n > Co, it holds that 

(B.l) mmi|| e -X^||2<m2(l- ,5) 2 /4, 

with high probability, implying that ^H-X'/SlH > 7T1 n(l — <5) 2 /4 = H(l), as to be 
shown. In order to show (B.l), the NNLS objective is decomposed as follows. 

(b.2) —lie - xMi = -(«- v T pf + lie - uMi) , 

where we have decomposed e as e = ( ^ J , with £ denoting the first entry of e. 

Let tC denote the convex hull generated by the columns of U and let v = n — 1. 
By Theorem B.3, for pjv > C\ and for some constant C 2 > 0, the event 



E 2 = {K, D C 2 r l/ipJ B 2 }, r„ )P = min{^, vdog(p/V)} 

holds with probability at least 1 — 2exp(— cv). Hence, conditional on E\ and 
E2, there exists j3 E having no more than v non-zero entries so that 

(B.3) Z = UP and l T /3 < (1 + 6)y/n r^C^ 1 . 

Moreover, it will be shown that there exists C3 > so that \\/3\\2 < C3. Let U' 
of dimension v x v be the sub-matrix of £7 that is obtained by extracting the 
columns of U corresponding to the non-zero entries of (3; accordingly, define (3' 
as the corresponding sub- vector of /?. Let further J = { j : f3'- > ^=-}, where 

C4 > is chosen such that 1 T (3 < C^^fv. Note that |J| < u/4. We have that 
(B.4) 

Uh = WPh = \\U'(3'\\ 2 > \\U'jP'j\\ 2 -\\U' Jc (3'j4 2 > s m UU'j)\\P'j\\2-s max (U')\\P'j4 2 . 
Conditional on E 2 , the event 

holds with probability at least 1 — 2exp(— z^/32) by twofold application of 
Theorem B.2 with z = Further note that by construction, ||/3j c ||2 < 4Ci. 
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Consequently, if ||/3j||2 were growing with n, the left hand side of (B.4) were of 
larger order than 0{-\fn), contradicting ||£||2 < (1 + 5)y/n conditional on E\. 
We conclude that conditional on E\, E-i and E3, \\PW2 < C3, C3 > 0. 
Equipped with these intermediate results, we are in position to upper bound 
(B.2). Since (5 is feasible for the NNLS problem, we have 

-We-XM<- ((C - v T (3f + ||e - UP\\i) < -C 2 + V#> 2 - 
n n \ J n n 

Since the entries of v are i.i.d. zero-truncated Gaussian random variables having 
mean y^§, we may decompose v = y^fl+w, where oj = v — ^J^l has i.i.d. zero- 
mean sub-Gaussian entries. Conditional on E\ to E3, the tail bound (A.l) yields 
that 



P(E 4 ) < exp(-c / nr-2), £ 4 = j v J ' f3 > P) + (1 + <5)v^ C 3 C 



2 r u,p ( ) 



for some constant c' > 0, cf. (B.3). Recall that according to (B.l), it suffices to 
show that 

-\\t\\ 2 2 + -(v T p) 2 <mi(l-5) 2 /A. 
n n 

The first summand on the l.h.s. of the inequality is of lower order and can 
hence be absorbed into the second term in form of a multiplicative constant /i n 
slightly larger than one, so that we need ^ L (v T j3) 2 < (1— <5) 2 /4, or equivalently, 

Conditional on E\ to E4, inserting the bound on l T /3 in (B.3) shows that we 
need 



which can be achieved by taking p/n > Cq for a sufficiently large constant Co- 
Combining this with the probability estimates for the events E\, . . . , E4, the 
proof is complete. 

APPENDIX C: PROOF OF THEOREM 3 

Since (5 is a minimizer of the NNLS problem (1.2) and since (3* is a feasible 
solution, we have that 

hv-xM<-\\v-xm 

n n 



(C.l) 



||(/ + e - X(3*) + X(3* - xp\\l <~\\f + £-XP* II2 

n 

-\\X/3* - X(3\\ 2 2 + -(/ + £- X/3*) T X(/T -(3) < 
n n 

~^\\XP*-XP\\1 < l(f-X(3*) T X(P-f3*) + l £ T X(P-(3*). 
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Write 5 = (3 - 13*, P = {j : 8j > 0} and N = {j : Sj < 0}. We now lower 
bound ±\\X6\\l = ^S? using the self-regularizing property (2.5). 



(C.2) 



— H-X^Ho — ^p'SppSp + 26pYlp]\i5j\i + '5IjY1nn5n 
n 

>r 2 (l T ?p) 2 -2||? P || 1 ||^|| 1 . 



Second, we bound the r.h.s. of (C.l). We set A = maxi<j< p 
the bound 



max 

i<j<P 



-xj(f-xp* 

71 J 



< max^=\\Xi\L\ -\\f-Xp*\ 



and use 



'8\ 



obtaining that 

(C.3) -\\X8\$<2(A + y/8* 

n 

Inserting the lower bound (C.2) into (C.3), we obtain 

(C.4) r 2 ||? P || 2 - 2||?p|| 1 ||? jv ||i < 2{A + V^)(||?p||i + \\8nWx). 

We may assume that 5p ^ 0, otherwise the assertion of the theorem would 
follow immediately, because ||<5tv||i 1S already bounded for feasibility reasons, 
see below. Dividing both sides by ||<5p||i and re-arranging yields 



(C.5) 



¥p\\i < 



4(A + V^) + 2\\5 N \\ 1 



where we have assumed that ||£jv||i — II^pIIi (if that were not the case, one 
would obtain ||£p||i < ||<5/v||i> which is stronger than (C.5), since < Tq < 1). 
We now substitute (C.5) back into (C.l) and add 8* = ±\\X/3* - /||| to both 
sides of the inequality in order to obtain 

8 = -\\XP- <£* + 2A{\\5 P \\ 1 + ||?Ar||i) 



n 



< 8* + 



6^||/3*||i + 8(A 2 + AV8* 



noting that by feasibility of /3, one has 5 ^ —[3* and hence ||<5jy||i < II ^Ill- 
Using the maximal inequality (A. 2) for a finite collection of sub-Gaussian ran- 



dom variables, the event < A < 2a 



21ogP 



holds with probability no less than 



1 — 2/ p. The result follows. 



APPENDIX D: PROOF OF THEOREM 4 

Arguing similarly as in the proof of Theorem 3, it will be shown that 

Ps c ||i < ^ll^slli) i- e - 8 £ 7£(3Ao , S) as defined in Condition 2; in the sequel, 
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all notation introduced in the previous proof are adopted. Note that S C CP 
and N <Z S. Hence, we obtain the following analog to (C.4). 

r 2 ||fec||f-2||^||i||fe||i<2^(||?s||i + ||fec||i). 

Dividing both sides by H^Hi, assuming that < \\6s\\i < p5 c ||i (otherwise, 
the claim 5 G H(3/tq, S) would follow trivially), we obtain 



roll^lli <4A + 2\\S S \\ 1 < 3||o 5 |U, 

assuming that 4A < \\8s\\i, i.e. 5 € 7Z(3/tq,S). Suppose conversely that 
\\8s\\i < 4A One then has 

fo||i<^, -\\X5\\1<2a\^ + 4a), 

so that the assertion of Theorem 4 follows. Conditional on 5 £ 1Z(3/tq, S), one 
may invoke the compatibility condition (2.10), which, when applied to (C.l), 
yields 

(^,s) \\5 s \\l < -\\XS\\l < 2A(\\S s \\i + \\5s4i) < ^lltelli 
which implies that 

fen. < - s f° \ . ii?-riii = ii*iii< 



We conclude by applying (A. 2) as in Theorem 1. 

APPENDIX E: PROOF OF THEOREM 6 

E.l. r 2 (S) for equi-correlation. We first prove (3.12) for the population 
Gram matrix £* = E[±X T X] = pi + (1 - p)ll T , p £ (0,1). We compute 
t 2 (S) = r 2 (s) from the matrix -Z T Z via (3.3). One computes that 

(VU (V- 1 ) 1 \l + {s-2)p j = k, 

and consequently, using (3.22), 

(E.2) (lz-z) = {WV(1 + (S-1)P) ," = *, 

In view of the simple structure (E.2), one verifies that the minimum in (3.3) is 
attained for A = l/(p — s), which yields that 

2 (l-p)p 1-p 

[) (s-l)p + l + p-s> 

as was to be shown. 

We state and prove three lemmas first, which rely on the following two theorems 
that can be found in [46]. 
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E.2. Bernstein-type inequality for squared sub-Gaussian random vari- 
ables. The following exponential inequality combines Lemma 14, Proposition 
16 and Remark 18 in [46]. 



Theorem D. 1. Let Z\, . . . , Z n be i.i.d. centered sub- Gaussian random vari- 
ables with sub-Gaussian norm K . Then for every a = (a±, . . . , a n ) T 6 W 1 and 
every z > 0, one has 
(E.3) 

\ ( ( z 2 \\ 

> z I < 2exp — cmm 



5>(^-E[Zf]) 



i=i 



K±\\a\\l'K*\V 



where c > is an absolute constant. 



E.3. Concentration of extreme singular values of sub-Gaussian ran- 
dom matrices. Recall that s m ; n (X) and s max (X) denote the minimum and 
maximum singular value of a matrix X. The following statement, which is a 
generalization of Theorem 2, is a special case covered by Theorem 39 in [46]. 

Theorem D. 2. Let X be an n x s matrix with i.i.d. centered sub- Gaussian 
entries having unit variance and sub-Gaussian norm K. Then for every z > 0, 
with probability at least 1 — 2exp(— cz 2 ), one has 

(E.4) \fn — Ca/s — z < s m - m (X) < s max (X) < \fn + C^fs + z, and 

(E.5) s max f -X T X -l)< m&x(5,5 2 ), where 5 = CJ- + ^=, 

\n J V n V n 

with C , c depending only on K . 



E.4. Additional lemmas. 



Lemma D. 1. Let Z±, . . . , Z n be i.i.d. centered, unit variance sub- Gaussian 
random variables with sub-Gaussian norm K. Then for all z > 



(E.6) P ( ^ Z i > n + zn ) ^ ex P(- cmin ("^4 ; -^) n 



vi=l 



K V K 2- 



PROOF. Noting that E[^" =1 Zf] = n and re-arranging, the result follows from 
Theorem D.l with a = (1, 1) T . □ 

In the sequel, we denote by X* the population covariance Ei[^X T X] = (1 — 
p)L p +pll T , where p G (0, 1) depends on the specific distribution for the entries 

(Xij). 

LEMMA D. 2. If X is a random matrix from Ens+ (3.13), then for all t > 
and any S C {1, . . . \S\ < s, with probability at least 1 — 2exp(— c\t 2 ) — 
exp (— C2 min (t 2 , t) s) 
(E-7) 

-X]X S - Z* ss ) < max(5,<5 2 ) + cJ^, 5 = + * 

n J V n V n v n 

where C, Ci, C2, c, ci, C2 > are universal constants. 
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PROOF. We decompose the rows {Xg}f =1 of Xs as X l s = X l s + fil, where 
fj, > is the mean of the entries, i = 1, . . . , n. We have 

(1 \ 1 n ~ 

-X]X S - X* ss ) = sup - V ({X'g + fil, v) 2 - E[(X| + /il, u) 
n / IMI,=i n ~[ v 



sup 



< sup 

IMIo=l 



1 



i=l 



1 n 

-£(<X^> 2 -E[<A1,t;> 2 ] 



i=l 



+ 2 sup 

ll«lla=l 



1 " ~ 



t=l 



The first summand is handled by an application of Theorem D.2. For the second 
summand, we have 



2 sup 

v. \\v\\ 2 =l 



1 n ~ 
( f ,l,v)-J2(X i s ,', 



i=i 



< 2 



1 n ~ 



1=1 



Re-writing the norm as 



i=l 



1/2 



1/2 



j'GS \ V i=l 



and noting that, as explained in Appendix A, the sub-Gaussian norm of the 
{Zj} is uniformly bounded by an absolute constant, say L, we invoke (E.6), 
which yields for all £ > 



P ( Z j > s + ts ] - exp ^ _cmm J2 1 s 



The claim follows by taking roots and back-substituting. 
Lemma D. 3. 



□ 



max 

i<j,k<p 



1 



X T X - £* 



< C 



logp 
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with probability at least 1 — 3/p — exp(— cn), where C, c > are universal 
constants. 



Proof. Write Xj = Xj — /jl, j = 1, . . . ,p, for the column vectors obtained by 
centering the columns of X. We have 

(E.8) X - ((Xj,X k ) - E[(Xj,X k )]) = ^(Xj,X k ) - Li (±(Xj, 1) + ^(^,1) 

For the second term in (E.8), we have, in view of the properties of sub-Gaussian 
random variables in Appendix A 



(E.9) 



(Xj + X k , 1) > V2fj,z) < 2exp(— cqtiz^ 



For the first term in (E.8), let us first consider the case j ^ k. Fix any j € 
{1, . . . ,p}. It follows from Lemma F.l that the event £j = {{{XjH^ < 2n} holds 
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with probability at least 1 — exp(— c\n). Conditional on £j, (Xj,Xk) is a sub- 
Gaussian random variable with sub-Gaussian norm bounded by L\/n, for some 
universal constant L > 0. It follows that 



(E.10) 



-(Xj,X k ) 



n 



> z < P 



— (Xj,Xk) 



n 



> z 



Sj + P(*£ 



< 2exp(— C2nz 2 /L 2 ) + exp(— c\n) < 2exp(— c^nz 2 ) + exp(— c\n). 



Let now j = k. With the aim to control the first term in (E.8), an application 
of Theorem D.l yields \/z > 



(E.11) 



1 n 
n ^ 



■i"7 



E[4-]) 



> z < 2 exp(— C4 min(z, z 2 )n). 



Combining (E.9), (E.10) and (E.ll), with a union bound over all p 2 entries of 



^X T X and setting z = 2/ ' \J minjco, C3, c^}\J — we obtain 



X T X - 



n 



> c 



logp 



< — h exp(— c\n + logo). 
P 



□ 



Proof of Theorem 6. Equipped with these auxiliary results, we turn to the 
actual proof of the Theorem. We analyze the random scaling of t 2 (S) using the 
dual formulation (3.3). In the following, denote by § s_1 = {11 £ I s : ||u|| 2 = 1} 
the unit sphere in M s . Expanding the square in (3.3), we have 



(E.12) 



t 2 (S) = min 6 T -XJX S 9 -29 T -XJx S c\ + X T -XJ c X S cX 
~" AeTP-"- 1 n n n 



> min r 2 M T S5 S u - r 2 s max ( -X]X S - ^* ss ) 

- 2ru T -XJx S c\ + \ T -XJcX S c\ 

n n 

> min r 2 M T S5 S u - r 2 s max ( -X] X s - T,* ss 

r>0, uGS 8 - 1 , A6TP- 3 - 1 \ n 

- 2pru T l - 2ru T {-X]X S c - Z* SS c)\ + p + 1 



sup 



n p 
A T (—Xg c Xs<: — T>* S c S c)\ 



For the last inequality, we have used that min^g^p-s-i X T T,* SCSC X = p + — - 
by setting A = l/(p — s). We further set A = s max (^XgXs — £55) an d 5 
sup ug§s -i AgT p-s-i \u T (-^XgcXsc — S1j C £ C ) A|. The random deviation terms A 
and S will be controlled uniformly over u € S s_1 and A € t^p-s- 1 by means of 
the two preceding lemmas, and are hence subsequently treated as constants. 
This approach allows us to minimize the lower bound in (E.12) w.r.t. u and r 
separately from A. The minimization problem involving u and r reads 

(E.13) min r 2 u T Z* ss u - 2pru T l - r 2 A - 2r5. 

r>0, u&s- 1 



v- 
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We first derive an expression for 



(E.14) (f)(r) = min r u T,$ s u — 2pru 1. 



We decompose u = + ^i -1- , where ij = ^^=,ity is the projection of u 

on the unit vector 1/y/s, which is the eigenvector of associated with its 
largest eigenvalue 1 + p(s — 1). By Parseval's identity, we have H^'Hl — 7> 
||tt ||| = (1 — 7) for some 7 E [0,1]. Inserting this decomposition and noting 
that the remaining eigenvalues of are all equal to (1 — p), we obtain the 
following expression to be minimized w.r.t. 7 E [0, 1] 

(E.15) r 2 7 (1 + (s- l)/>) +r 2 (l - 7) (W) -2pr^^~s, 

where we have used that (it 1 -, 1) = and that all potential minimizers must 
satisfy > 0. Let us put aside the constraint 7 E [0,1] for a moment. 

The expression (E.15) is a convex function of 7, hence we may find an (un- 
constrained) minimizer 7 by differentiating and setting the derivative equal 
to zero. This yields 7 = -5-, which coincides with the constrained mini- 
mizer if and only if r > Now observe that the minimizer of the problem 

min r> u6 § s -i r 2 u T Tjg S u — 2pru T l with r being unfixed equals the minimizer 
6 of the problem miners 9 T Y>* SS 6 — 2p6 T 1, which is given by 9 = = 



"75 ' i + (gl P i) p ; a un it vector satisfying 7 = 1 times a radius less than 1 / yfs. We 
conclude that for all r < 1/yfs, the minimum is attained for 7 = 1, hence the 
function 0(r) (E.14) is given by 



(E.16) 



r 2 s max (E* ss ) - 2rp^/i r < 1/y/a, 
r 2 (l — p) — p otherwise, 



where the second line is obtained by inserting 7 = -J- for 7 in (E.15). The 
minimization problem (E.13) to be considered eventually reads 

(E.17) m.mip(r), ip(r) = <fi(r) — r 2 A — 2rb. 

r>0 

We argue that it suffices to consider the case r < \ j\fs in (E.16) provided 
(E.18) ((l-p)-A) 2 >,5 2 S , 

a condition we will comment on below. If this condition is met, differentiating 
shows that ip is increasing on [-^, 00). In fact, for all r in that ray, 

4-ip(r) = 2r(l - p) - 2rA - 25, and thus 
dr 



—ijj(r) > for all r £ 
dr 



±=,00) o i=((i-p)-A) >S & ((1- P )-A) 2 >s5 2 . 
\/s / Js 



Considering the case r < 1/y/s, we observe that ip{r) is convex provided 
(E.19) s max (E* ss ) > A, 
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a condition we shall comment on below as well. Provided (E.18) and (E.19) 
hold true, the minimizer r of (E.17) is given by (py/s + S)/(s max (T,g S ) — A). 
Substituting this result back into (E.17) and in turn into the lower bound 
(E.12), one obtains after collecting terms 
(E.20) 



r\S) > p- 



(l-p)-A 



2p^5 + 5 2 1 - p 



(1 - p) + sp - A s 
Consider the two events 



+ 



A p — s 



sup 



A (—X S cXsc — T<* S c S c)\ 



A 



A<d| J s2l °Z 1/2 P + ^P 



■n 



n 



>! 


(...) 


f 

max 









logp 



n 



for universal constants Ci,C 2 > 0. Conditional on An B, bounding 



5 < sup Hul^ sup 



—XgcXsc 
n 



A 



logp 



n 



and inserting the scaling for A under A, there exists a sufficiently large constant 
C > such that the two conditions (E.18) and (E.19) supposed to be fulfilled 
previously indeed hold given that n > C\og(p)s 2 . We may re-write (E.20) as 



(E.21) 

t\S) > 



p(l-A/(l-p)) 



+ 



6 2 /(l + (s-l)p) 



(l_ A/( l_ p) ) + s _p_ i_ A /(l + ( s -l)p) l-A/(l + (*-l)p) 
1 



sup 

AgTP _ s _i 



A (—X S cXsc — T>* S csc)\ 



Conditional on An B, there exists again a sufficiently large constant C > 
such that if n > C\og(p)s 2 



(E.22) 



1 n /logp logp /logp 1 /logp 

ci- - CW C 4 C 2 \ = ci- - C 5 \ 

s V n n Vra s V n 



by inserting the resulting scalings separately for each summand in (E.21), 
where ci, C3, C4, C7g > are universal constants. We conclude that if n > 
max(6, C) \og(p)s 2 , (E.22) holds with probability no less than 1-P(.4)-P(S). 
Using Lemmas F.2 and F.3 to control ~P(A) and P(S), the result follows. 

APPENDIX F: PROPERTIES OF THE CONSTANT w(5) 
In this section, we derive the following three properties, 
(t) w(5) > & r(5) > O A S ]R; is a face of C, 

(n) < 1, with equality if {Xj}j<=s -L {-Xj}jeS c an( A -^J^S c h 

(iii) U)(S) > min — (Z T Z)jj H — min{(Z T Z)jk, 0}, 
i<j'<(p-s) " ^ 

Proof, (i): From (3.3), we have 

(F.l) t 2 (S)= min — IIZAIIo , hence 

AeTf-s- 1 n 2 
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3A e TP' 3 ' 1 s.t. ZA = => Z T ZX = => ||Z T ZA|| 0C = => w(S) = 0. 

On the other hand 

3w e V(F) s.t. HZjZ^lloo = =► = =► r(5) = 0. 

The second equivalence is by the definition of a face of a cone, 
(ii) Consider all principal sub-matrices —ZpZp- By definition, oj(S) equals the 
maximum of the absolute values of the entries of ^Z^Zpy, where one minimizes 
over all v contained in the boundary of the unit cube [0, ly F K We may restrict 
our attention to matrices ^Z T Z which are entry- wise non- negative. To see this, 
assume that there exists a non- negative off-diagonal entry for a pair (j, k). Then 
pick Fq = {j, k} and set V(Fq) = {v G R 2 : v >z 0, \\v\\ = 1} to obtain that 



)(S) < min \\-ZjZ F v\\ O0 <max{-(Z T Z) jj + -(Z T Z) jk , 

— (Z T Z) kk H — (Z T Z)jk\ 
n n J 



n n 

< max\-{Z T Z) jjt -(Z T Z) kk \ < 1, 
[n n J 

re-calling that H-ZjH^ = llng^/lll < H^jH^ = n f° r an 3- If Z~^ Z is entry- 
wise non- negative, a similar argument shows that ui(S) equals the minimum 
diagonal entry of ^Z T Z, which is upper bounded by 1. Using expansion (3.22), 
we may write 

-Z T Z = -XjcX S c XgcX s ( -X]X S ) XgX S c, 

n n n \n J 

orthogonality implies that -Z T Z = -Xj c Xso. Using entry-wise non-negativity 

1 — r 1 1 1 1 2 

of —XgcXgc together with ||^j|| 2 = n for all j, the assertion follows. 

(iii) can be derived directly from the definition of co(S): 

u(S) = min min -ZlZ F v , V(F) = {v G : IblU = 1, t; >- 
$^FC{l,... lP -s}veV(F) 



n 



1 

min min max — 

®¥=FC{l,...,p-s}v&V(F) j£F n 



[Z T Z) jj v j + y^(Z T Z) jk v k 

k+3 



> min min min — (Z Z)n +y(Z Z)j k v k 
~ 9^FC{l,...,p-s} j&F k, o<« fc <r 



n l\(z T z) jj + J2iz T z) 3 

min min— I (Z T Z),;j + min\(Z T Z)i k , 0) 
{ i,..., P - s}j eFny £j 

- I (Z T Z) JJ + ^min{(Z T Z) jfe ,0}( 



mm 

i<i<(p-s) « 



□ 
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APPENDIX G: PROOF OF THEOREM 9 



The proof is along the lines of [22] (proof of their Theorem 3). Note that in 
the setting of Theorem 7 or 8, if the condition imposed on /3 m i Q (S) in 9 holds 
true, the event E = {rj < s, j G S} holds with probability tending to one. 
The probabilities of the two events The probabilities of the two events 

E > = {s > s}, E< = {s< s} 

are controlled separately. We have that 

P-i 

E> C (j{5(k) >d^/2A^n} 

k=s 

Conditional on E, each of the 6(k) = \uJe\, k > s, where Uk is a unit vector that 
spans the linear space Uk = Vk+i fl V k r, where for I = 1, . . . ,p, V\ is the linear 
spanned by the variables with rank no larger than I. Hence, the {S(k)} follow 
sub-Gaussian distributions with expectation 1 and sub-Gaussian parameters 
depending only on the sub-Gaussian parameter a of e and a universal constant, 
so that P(5(k) > "Q\J2 logn) = o(l/n), by virtue of the assumption d > c&. 
Note that S(k) = whenever Vk+i = Vk- Since X spans a subspace of dimension 
no larger than n, 5(k) can be nonzero no more than n times. By a union bound 

over the events \s(k) > d^2 logn) P , P(E > \E) = o(l). 

L J k=s 

The probability of the event E < can be upper bounded by following the proof 
in [22] line by line, while noting that the slightly more general notion of sub- 
Gaussian noise e can be accommodated easily. 



APPENDIX H: PROOF OF THEOREM 10 

Consider the non-negative lasso problem (3.29). It follows from the KKT op- 
timally conditions that any minimizer >£ of (3.29) satisfies 
(H.l) ' 

2Xj(y - Xp^t) = A and (3^'- > 0, 2Xj (y - X^^) < A and f^- = 0. 

First note that Ao is chosen such that the event {^||A rT e|| 00 < Ao} holds with 
probability at least 1 — 2 /p. Subsequently, we work conditional on this event. 
Following the technique employed in [47], we establish that for A > 2Ao/7(5") 
with 7(5") defined by (3.30), the unique minimizer is given by f3g'~ = S5 
and /3 S c^ = 0, where a denotes the minimizer of the following constrained 
non-negative lasso problem 

min -\\y-Xp\\l + \l T p. 

To this end, in view of (H.l), we have to show that the following system of 
inequalities is satisfied 
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Using the lower bound on f3 m i Q (S) in (3.31), as >- and we may resolve 
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Substituting this into (H.2), we find that the following system of inequalities 
must hold true. 

Each component of the left hand side is no larger than 4(1 — 7(«S 1 )) + Ar> It 
follows that for A > 2Xo/j(S), the system of inequalities is indeed fulfilled. 
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